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Abstract 

We present HORIZON, a new graphics processing unit (GPU)-accelerated code to solve the 
equations of general relativistic magnetohydrodynamics in a given spacetime. We evaluate the 
code in several test cases, including magnetized Riemann problems and rapidly rotating neu- 
tron stars, and measure the performance benefits of the GPU acceleration in comparison to our 
CPU-based code THOR. We find substantial performance gains in comparison to a quad-core 
CPU both in single- and double-precision accuracy, and discuss these findings in the context of 
future numerical modeling efforts. 

Subject headings: magnetohydrodynamics (MHD) — methods: numerical 



1. Introduction 

Computational models of general relativistic 
flows are an important tool to understand the dy- 
namics of compact objects, more so since gravita- 
tional wave detectors like LIGO, VIRGO, TAMA 
or GEO600 are becoming operational. To facil- 
itate interpretation of data once they have been 
obtained, the inverse problem of determining 
source parameters can be studied with the help 
of simulated models. 

General relativistic magnetohydrodynamics 
(GRMHD) is the study of magnetized flows un- 
der relativistic velocities and in very strong grav- 
itational fields, and it is therefore ideally suited 
for modeling compact objects and their environ- 
ments (for a review see [Fontl < |2008) ). A number 
of GRMHD codes have been developed in re- 
cent years, see Koide et al. (|1999), De Villiers & 
IHawleyl^OOS^jGammie et al.|(|2003 ,|Komissarov 
(|2004), |Ant6n et al.| ([2006), |Mizuno et al.| <|2 006 1, 



Anderso n et al.| ^2006^ 



Giacomazzo & Rezzolla 



1 2007} , IDeTZanna et al. 
2008 1, ICerda-Duran et al. 



(|2007||, |Yamamoto et aL 
<|2008|) " 



Bucciantini 



& Del Zanna| < |2010) and |Etienne et al.| < |2010) ." 
Most of these approaches employ finite-volume 
methods on structured meshes, and use high- 



resolution shock capturing schemes for handling 
the interface fluxes. Therefore, they are ideally 
suited for highly compressible flows with shocks, 
and enjoy conservative properties which also en- 
sure numerical stability. 

Since these models are computationally expen- 
sive, not least due to their proper account of ul- 
trarelativistic motions and the general covariant 
form of the conservation laws, there is a strong 
impetus to explore the use of graphics processing 
units (GPUs) to accelerate the calculations. GPUs 
offer a substantially higher peak performance 
than traditional central processing units (CPUs) 
PGrk & Hwul|2010) , and are better suited for the 
high throughput, data-parallel problems typical 
of large-scale simulations. Therefore, these ar- 
chitectures have become very popular in recent 
years, especially since vendors like NVIDIA have 
started to target the high-performance comput- 
ing market with specific hardware features and 
appropriate software tools. 

Graphics processing units have traditionally 
been designed for computer graphics algorithms. 
Most of these algorithms are data-parallel, i.e. the 
same operations are independently performed 
on a large stream of data. In contrast to central 
processing units, which are optimized for exe- 
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cuting single threads rapidly, GPUs, as a data- 
parallel architecture, trade single-thread perfor- 
mance for massive parallelism. While CPUs 
perform computations on a few cores, current- 
generation GPUs may utilise hundreds or even 
over a thousand compute cores on a chip. 

It is probably fair to say that GPUs do not 
only offer a much higher realizable performance 
at present, but also share architectural features 
which should become more prevalent in any 
hardware optimized for high throughput, data- 
parallel computations ( |Garland & Kirk 20101. 
One reason is given by power and cooling con- 
siderations. Traditionally, commodity CPUs have 
gained speed by raising the frequency of opera- 
tions, which gave rise to higher power consump- 
tion and more challenging cooling requirements. 
This approach cannot be scaled indefinitely, and 
therefore CPU designs have reached a power wall. 
A second factor is the memory wall: increasing 
latencies in accessing main memory are covered 
in CPUs by using large caches and sophisticated 
control logic, and the relative benefits of these ap- 
proaches decrease with higher integration of the 
circuits. 

GPUs use massive parallelism in place of large 
caches and expensive control logic, hiding mem- 
ory access latencies with high-throughput com- 
putation on large streams ( Garland & Kirk|2010) . 
Since more transistors are devoted to compute 
units compared to CPUs, the peak performance 
of GPUs (and similar architectures) is expected 
to increase substantially faster than in CPUs. Be- 
cause of this, high-performance computing will 
greatly benefit from GPUs or similar designs in 
the future. 

Attempts to use GPUs for scientific applica- 
tions have increased rapidly with the recent in- 
troduction of NVIDIA's CUDA language. Early 
adopters of this approach in astrophysics in- 
clude |Portegies Zwart et aL|(|2007),|Belleman et al. 
( 200 8j,|Zink|(|2008|l, andlSchive et al.|<|2008|>.|Porte- 
gies Zwart et al.|(|2007) , |Belleman et al.| < |2008[ > and 
Schive et al. (20081 were concerned with N-body 
calculations (also using GRAPE, another paral- 
lel architecture), whereas Zink ( 2008) performed 
experiments on the general relativistic field equa- 
tions. |Barsdell et aT1 ( |20To) and |Fluke et"aT] ( |20To"} 
discuss future uses of GPUs in astrophysical sim- 
ulations, and Choudhary et al. ( |2010) consider 



applications to numerical relativity. Herrmann 
et al.| ( |2010) apply CUDA to the post-Newtonian 



equations describing the binary-black hole prob- 
lem, and Khanna & McKennon < |2010[ ) investigate 
extreme-mass ratio inspirals. Specifically in the 
context of magnetohydrodynamics simulations, 
Pang et al. ( |2010) compare the performance of dif- 



ferent hardware architecture s. |Wong et al.| (2009), 
Wang| ( |2009) and |Wang eTaT] ( |2010) present imple- 



mentations of (Newtonian) MHD in CUDA, and 
observe very promising performance relative to 
CPU implementations. 

In this paper, we present an implementation 
of general relativistic magnetohydrodynamics on 
graphics processing units. The HORIZON code is 
a standalone CUDA/ C++ application which is in 
part based on the GRMHD code THOR ( |Zink et al.' 
|2008f|Korobkin et al.||2010j ). In the following, we 
will first briefly describe the physical system and 
the finite-volume method in section [2} and then 
present the HORIZON code in section [3] In sec- 
tion |4j we evaluate the relative performance gain 
afforded by the GPU implementation compared 
to the CPU code. Afterwards, in section[5] we illu- 
minate how differences between single and dou- 
ble precision arithmetics affect simulation results 
in a number of test cases. We conclude with a dis- 
cussion in section 

2. Physical system and numerical method 
2.1. Evolution system 

The domain of HORIZON are simulations of 
magnetized flows in general relativistic astro- 
physics. Similar to other codes in this field, e.g. 
THOR (|Zink et al.| [2008) WhiskyMHD (Gia- 
comazzo & Rezzolla| 12007) , Co CoNuT (jCerda- 
Dur an et al.||2008[ l and Sacra ( |Yamamoto et al. 
2008), this allows to investigate the properties 
and dynamics of ultra-compact sources of gravi- 
tational radiation, including neutron stars, mag- 
netars, and accretion flows around black holes. 

In this paper we will only account for rela- 
tivistic flows on a fixed spacetime background, 
which is known as the Cowling approximation in 
the context of stellar oscillations. Given such a 
spacetime metric gy V , the conservation laws and 
Maxwell's equations are (Font 2008): 
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Here, = pu) 1 is the rest-mass density current, 
T^ v is the energy-momentum tensor of the fluid 
and the Maxwell field, J v is the electric four- 
current, Ft' v is the Faraday tensor, and *F^' V is its 
Hodge dual. 

We will assume the ideal MHD approximation 
to be valid in the following. Using a 3+1 split of 
Einstein's field equations, and an associated de- 
composition of the GRMHD equations, we arrive 
at the conservation form ( |Noble et al.|2006| 
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In these equations, we have used the magnetic 
field B i = *F U , the Christoffel symbols T A }tK , and 
the magnetic four-vector W is given in terms 
of the spacelike 3-hypersurface normal n'' as 
W = n K *F VK (6f v + ut l u v )/n K u K . We identify the 
eight conservative quantities yj—g]*, y , -jT'« 
and \/—gB l which form the basis of the finite- 
volume method. 

Initial data for the evolutions reported later 
will be either given analytically (for Riemann 
problems) or will be imported from the publicly 
available RNS code (Stergioulas & Friedman 1995) 
for the construction of rapidly rotating neutron 
star models. 

2.2. Overview of the numerical method 

We discretize the system of conservation laws 
eqns. l[2j with the use of a finite volume method 
and approximate Riemann fluxes at cell inter- 
faces. This scheme has shock-capturing prop- 
erties and is implemented in THOR and other 
GRMHD codes. 

The system of conservation laws (|2| can be 
written in the compact form 



dtW + dif l {u(w)) — s(u(w)) 



(3) 



where w is the tuple of conserved variables 
{yf—g]* and so on), u is the tuple of primitive 
variables (p and so on), and f l is the flux vector. 
For the finite volume approach, we discretize the 
coordinate domain into a regular mesh of cells, 
and, for each cell, cast the conservation laws into 
a weak form by volume integration. 

The evolution starts with converting the initial 
data, which is commonly given in terms of the 
primitive variables, into the conserved variables. 
The relation w(u) is algebraic and can therefore 
be easily evaluated in each cell. For a time up- 
date, we need to evaluate the (time and area aver- 
aged) cell face flux vectors f . These are obtained 
from assuming a local Riemann problem at the 
center location of the face, and calculating the flux 
function appropriate for this problem. 

The initial data for the face Riemann problems 
are obtained from the cell center data via a re- 
construction. To avoid unphysical oscillations in 
the reconstructions these approximants are con- 
strained by additional requirements (TVD and 
similar) which lead to extended reconstruction 
stencils compared to Lagrangian interpolants of 
the same order. Each cell interface needs data re- 
constructed at its left and right side to define the 
local Riemann problem. 

After the initial state for the Riemann problem 
has been obtained, the appropriate flux is deter- 
mined via a local operator on each face. Since 
the full Riemann problem is rather expensive to 
solve (and even more so when magnetic fields 
are included ( jGiacomazzo & Rezzolla|2006| ), we 
employ an approximate Riemann solver which 
only requires the values of the flux function from 
each (left and right) reconstructed primitive state, 
and the approximate maximal wave speeds asso- 
ciated with them. 

When all face fluxes are known, we calculate 
the cell update from the effective flux per cell, 
and add the local source term contributions. We 
then have the new values of the conserved vari- 
ables w, but to be able to continue with the next 
operation we will also need consistent primitive 
variables u. The inversion u(w) is, unfortunately, 
not algebraic in general, so we employ a Newton- 
Raphson method to obtain the primitive state. 
This concludes a (sub-)step of the Runge-Kutta 
time advancement. 
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3. The Horizon code 

3.1. Suitability of GPU architectures for gen- 
eral relativistic magnetohydrodynamics 

Since the computational operations in the 
finite-volume model described above are local 
(inside the support of a small stencil of only a few 
cells), the problem is naturally data-parallel via 
domain decomposition. This fact has been used 
extensively in parallelizing GRMHD codes to 
distributed-memory architectures via MPI, such 
that each processor operates on a compact sub- 
set of the full domain, e.g. a block of 40 3 cells. 
Boundary information is then shared via syn- 
chronization over ghost cells, and every time new 
data is needed for reconstruction, MPI messages 
are exchanged between (geometrically) adjacent 
processes. Then, each process operates indepen- 
dently on its block using local loops Q 

This type of coarse-grained block decomposi- 
tion, while data-parallel, is not sufficient for using 
graphics processing units effectively. The num- 
ber of processes in a typical MPI simulation will 
number in the tens or hundreds in many cases, 
whereas GPUs require tens or hundreds of thou- 
sands of threads to be saturated. The reason is 
that the simpler control logic and lower number 
of transistors devoted for advanced intermediate 
caches exhibit typically very high memory access 
latencies (in the order of several hundred cycles), 
which are hidden in the data parallel model by 
oversubscribing hundreds or thousands of com- 



pute cores with many more threads (see Garland 
|& Kirk| ( |2ul0) for details). 

Therefore we will employ a fine-grained method 
of parallelism: here, each compute core on the 
GPU operates either on a single cell or a small 
set of cells. Given that a simulation may employ 
millions of cells in total, domain decomposition 
performed in this way can easily provide enough 
threads to saturate a GPU. If several GPUs are 
used for a larger simulation, a natural approach is 
to combine a coarse-grained decomposition into 
blocks as before, and then perform fine-grained 
parallel computation on each GPU. 

1 This method can potentially be extended by using OpenMP 
on a cluster node, thereby parallelizing intra-node operations 
not via MPI but using the implicit communication employed 
by OpenMP. This approach can be employed to reduce the 
memory overhead involved with layers of ghost cells. 



Synchronization on the level of the GPU, 
which in itself is a shared-memory architecture, is 
easy to perform by executing independent com- 
pute kernels, and then using barrier statements 
to ensure that the global memory state on the 
device is consistent. Between GPUs, MPI syn- 
chronization using ghost cells can in principle be 
used as before, with the added complication that 
current MPI libraries cannot directly access the 
GPU memory, such that the transfer of ghost cells 
from the device memory to main memory (and 
back) must be performed before and after each 
MPI exchange. 

Having addressed the matter of how to gener- 
ate enough GPU threads and combine GPUs par- 
allelism with MPI parallelism, a more challenging 
question concerns the kind of performance gains 
we can expect. Producing enough threads is easy 
in a naturally data-parallel problem of large size, 
but this is not the only important consideration as 
far as performance is concerned. 

The reason many threads are employed is to 
hide main memory access latencies, as mentioned 
above. However, to achieve high performance 
the number of arithmetic (floating-point) oper- 
ations in relation to main memory access op- 
erations, the so-called arithmetic intensity, must 
be high, so that the thread scheduler can run 
register-based operations in certain threads while 
other threads are still waiting for data. This can 
be a problem for simple operations like matrix- 
vector multiplication (extensively used e.g. in 
some elliptic solvers) and even simple Newto- 
nian fluid dynamics codes, since the kernels then 
become limited by memory bandwidth, imply- 
ing that processing cores are often idle waiting 
for data. 

The theoretical memory bandwidth of a GPU 
is much higher than for a CPU, but it is only 
available if the programmer is very careful with 
data alignment and observes so-called coalesc- 
ing rules while reading from memory. The lat- 
ter refers to arranging memory operations within 
several threads in a particular way, such that 
load and store operations can be performed in 
a faster manner. Ignoring the (at times rather 
arcane) rules of memory access coalescence can 
significantly impact the available memory band- 
width (down to l/10th in some architectures), 
and with it the available speedup for any appli- 
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cation which is bandwidth-limited. 

These coalescing rules are one of the main rea- 
sons GPU programming is known to be challeng- 
ing. In the field of (Newtonian) computational 
fluid dynamics, which often has a low arithmetic 
intensity and, in addition, often needs to operate 
on unstructured meshes, efficient compute ker- 
nels are a major challenge for software develop- 
ers. Some CFD methods, however, lend them- 
selves to higher arithmetic intensities, e.g. Dis- 
continuous Galerkin techniques, and are there- 
fore considered for GPU applications (Klockner 
et al.|2009| . 

Fortunately, this is precisely where GRMHD 
is different. The relativistic equations of mo- 
tion contain many more operations per cell than 
is common for Newtonian finite-volume codes, 
even considering that more variables are needed 
to store the spacetime metric. The higher arith- 
metic intensity of GRMHD reduces the relative 
pressure on memory bandwidth from the com- 
pute kernels, and therefore can make use of a rel- 
atively larger portion of the (very high) compute 
performance of the GPU without excessive con- 
siderations of coalescing rules. The fact that re- 
cent architectures offer automatic caches makes 
the situation even more interesting for general- 
relativistic numerical codes. 

There are additional performance considera- 
tions for writing fast GPU codes. Beside the mem- 
ory coalescence, another important factor con- 
cerns the so-called block size. Current GPUs have 
a SIMD (single instruction multiple data) archi- 
tecture also on the hardware level, in the sense 
that a number of compute cores are arranged into 
multiprocessors with a common instruction de- 
coder. In practice, the GPU operates on a set of 
threads (a warp) in a single step, necessitating the 
exact same instruction to be performed up to the 
operands. The programmer can still use condi- 
tional statements for convenience (in contrast to 
explicit vector instructions), but if the command 
unit detects a divergence within a warp, the num- 
ber of different code paths are serialized, with a 
comparable loss of available peak performance. 
For current architectures, this can be a factor of 
up to 32. 

From the warp size, it is therefore desirable to 
have precisely common (essentially vectorized) 
execution parts at least as far as multiples of 



32 are concerned. This is a concern with stan- 
dard GRMHD in as far as the recovery of prim- 
itive variables are concerned, since the Newton- 
Raphson method may easily produce divergent 
execution paths depending on the cell data. How- 
ever, in all experience from CPU simulations we 
do not expect this part of the code to be important 
for the overall speedup (even when considering 
Amdahl's law ( |Amdahl| |T967|), and in addition 
the loss of performance from warp serialization 
competes with other possibly limiting factors. It 
is however important to arrange kernels to exe- 
cute threads in a multiple of the warp size to use 
all cores effectively. 

Another factor affecting performance also has 
to do with the way threads are distributed onto 
different multiprocessors. Each multiproces- 
sor (remember that the GPU contains several 
such multiprocessors) contains fast local mem- 
ory, which provides cache storage, but also space 
for the registers used for floating point opera- 
tions. If several blocks of threads are assigned to 
a multiprocessor, they have to share the common 
memory resources. In adverse cases, this could 
lead to register spilling, where operands have to be 
"cached" into the device's main memory, thereby 
strongly reducing available performance. Fortu- 
nately, it is possible to find a good thread config- 
uration for each compute kernel simply by a set 
of small experiments. 

Even the shared memory accesses can have 
varying performance since the memory is orga- 
nized into banks, and random access to data can 
lead to bus conflicts and corresponding serializa- 
tion. This, however, is often only a minor con- 
sideration when compared to other factors affect- 
ing performance, and recent architectural devel- 
opments for GPUs have further decreased its rel- 
evance. 

We have, so far, left out one of the likely most 
important influences on GPU performance in cur- 
rent architectures: the differences between sin- 
gle and double precision floating point opera- 
tion cost. Historically, GPUs have evolved in 
the context of graphics applications, which typ- 
ically do not require more than single-precision 
accuracy even in demanding situations. Very re- 
cent GPU architectures, which have been devel- 
oped in recognition of the potential use of GPUs 
in scientific, engineering and financial environ- 
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merits, have added support for double-precision 
operations, albeit at a higher cost in terms of 
clock cycles. In addition, bandwidth require- 
ments for loading and storing double-precision 
numbers are naturally higher. The most recent 
development in this direction is the Fermi archi- 
tecture by NVIDIA which reduces the clock cycle 
difference between single and double precision 
operations. 

While the ratio between single and double pre- 
cision performance may be reduced in future ar- 
chitectures, at present it is uncertain how this dif- 
ference will affect GRMHD codes in practice. It 
is quite possible that many operations in a scien- 
tific application do not actually require double- 
precision accuracy, and only a select few opera- 
tions, particularly those involving the (badly con- 
ditioned) subtraction of similar but large num- 
bers, should be cast to double precision floating 
point numbers. If the areas of code where this 
is necessary can be identified, and if those are 
converted to double precision, the overall per- 
formance of the code may well be close to the 
one for single precision, while the overall level 
of accuracy should be close to the double preci- 
sion result. The effectiveness of such a hybrid ap- 
proach will depend on the numerical system. For 
GRMHD, we expect most operations to be well- 
conditioned, though in particular the transforma- 
tion of conserved to primitive variables may give 
rise to difficulties in some situations when using 
single-precision floating point numbers (in fact, it 
could be argued that this statement also applies 
to double precision arithmetics). 

Overall, we expect GRMHD to map very well 
to GPU architectures, due to its massively data- 
parallel nature, its mostly regular memory ac- 
cess patterns (which facilitate caching), and its 
comparatively high arithmetic intensity resulting 
from the general relativistic set of equations. 

3.2. Horizon 

The HORIZON code is an GPU implemen- 
tation of general relativistic magnetohydrody- 
namics using the discretization described in sec- 
tion 2.2 It supports two-dimensional and three- 



tion from conserved to primitive variables, can 
be optionally performed exclusively in double 
precision, even if the rest of the calculation is 
done with single precision numbers. The code 
employs HLLE (Harten, Lax, van Leer, Einfeldt) 
fluxes ( Harte n" et al.||l9 83l, and a choice between 
TVD reconstruction with an MC limiter, and PPM 
reconstruction (see |Martr & Miiller (1999]) and 
Font ( 2008) for details). Primitive variables are 
recovered using either the lDp or 2Dyy meth- 



ods from Noble et al. 1 2006 1, and time is ad 



vanced via a third-order Runge-Kutta method. 
A hyperbolic divergence cleaning scheme ( |An 



derson et al.| 2006) is available for damping vi- 



dimensional meshes, and both single and double- 
precision floating point accuracy calculations. 
Particular kernels, in particular the transforma- 



olations of the relativistic solenoidal constraint 

3. (>/=P0 = 0- 

A practical consideration when writing code 
for GPU architectures is the choice of parallel pro- 
gramming language. Standard Fortran 90 or C 
can not be used for writing GPU code, but there 
are a number of language options available for 
data-parallel and stream programming problems. 
A very popular choice is NVIDIA's CUDA for 
C, which is an extension of C/C++ to accommo- 
date compute kernels and functional program- 
ming on the device. In practice, CUDA uses stan- 
dard C++ code enriched by a number of addi- 
tional keywords which identify functional parts 
to be executed on the GPU. Host (CPU) code is 
extracted from the source and compiled with a 
native C++ compiler, e.g. GNU g++, whereas de- 
vice (GPU) code is transformed into a special in- 
termediate language (similar to assembly) which 
is transformed for execution by a runtime library. 
From the programmer's point of view, these steps 
are all transparent, which makes CUDA a partic- 
ularly easy choice in practice. 

The disadvantage of using CUDA is that it is 
a proprietary language which is only compatible 
with NVIDIA GPUs. An alternative is OpenCL, 
which is designed as an industry standard to re- 
place proprietary solutions, but at this stage it 
is less mature, less convenient to use, and not 
quite as well supported by the hardware drivers. 
In addition, there are a number of readily avail- 
able numerical and parallel processing libraries 
for CUDA which are not ported yet to OpenCL. 
Overall, developing scientific code is much more 
straightforward with CUDA at this time, which 
also explains its high popularity in a number of 
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scientific computing fields. 

HORIZON is written in CUDA, and there- 
fore based on object-oriented programming via 
C++. In particular, in contrast to THOR, it is not 
part of the CACTUS infrastructure, but is a stan- 
dalone application code. It employs a Cartesian, 
uniform-mesh finite-volume discretization of 
GRMHD, distributed into host-side code for ini- 
tialization, memory allocation and input/ output, 
and CUDA compute kernels for the mathematical 
operations. 

The compute operations for evolutions are en- 
tirely performed on the CUDA device. This must 
be stressed, since a simple model for accelerating 
an algorithm could be to first copy data to the 
GPU, perform computations, copy back to host 
memory, and repeat these steps. However, mem- 
ory transfers between host memory and the GPU 
device (which use the PCI-Express bus) are rather 
slow compared to the GPU kernel execution time 
and available memory bandwidth, so constantly 
copying data would severely limit the possible 
level of acceleration. In fact, there have been cases 
where only the kernel execution times have been 
used to measure speedups. HORIZON runs en- 
tirely within the GPU, and therefore the speedup 
mentioned here are what a practitioner would ac- 
tually receive in terms of overall execution speed 
of the entire application (when no output is con- 
sidered). This is sometimes also called application 
speedup. 

3.3. Particular implementation aspects 

In the following, we will mention a few partic- 
ular choices we have made to obtain high overall 
performance in HORIZON. Even though GRMHD 
is a very suitable system for GPU acceleration, 
care must still be taken in order to achieve high 
performance gains: see also Zink ( |2008) for a 
more detailed exposition in the context of the 
ADM equations. 

3.3.2. Memory layout and data alignment 

We have discussed the issue of coalesced mem- 
ory access above. While GRMHD has a high 
arithmetic intensity, it is still very important to be 
aware of the major performance considerations 
concerning coalesced memory access when writ- 
ing compute kernels. While we will not be able to 



discuss the individual choices made for each ker- 
nel here, we would like to point to two optimiza- 
tions which are rather general and should apply 
directly to similar implementations. 

The first choice is to approximately satisfy 
the alignment requirements for coalesced mem- 
ory access by using properly "padded" memory 
blocks for three-dimensional meshes. CUDA 
supplies special functionality to automatically 
align memory blocks in an appropriate way, 
which requires also a slightly different address- 
ing pattern inside the kernels. In cases where the 
grid size is not a multiple of the pad size defined 
by the device's coalescing rules, this will lead to 
a slight memory overhead, but also to increased 
performance. 

The second general consideration is to use 
blocks of threads which are properly aligned to 
allow coalesced accesses to the padded grid func- 
tion arrays. The SIMD nature of a multiproces- 
sor already prescribes the threads per block to 
be a multiple of the warp size to avoid warp di- 
vergence, but in addition coalesced access needs 
to read continuous blocks of padded data words 
using subsequent threads. Therefore, the access 
pattern employs subsequent threads in the x di- 
rection by a multiple of the warp size, since the 
data layout is consecutive there. This also fa- 
cilitates more effective caching. Each block of 
threads then operates on a slab of cells in x and 
y direction, e.g. (32,2) or (32,4), and performs a 
loop over the z direction. 

3.3.2. Use of automatic caches 

In older GPU architectures, no automatic 
caches were available, and therefore a set of mem- 
ory shared between threads inside a multipro- 
cessor could be used to provide local data. The 
advantage is that (fast) coalesced memory loads 
can be employed to stage this cached data, and 
then operate almost randomly on the shared data 
inside the compute kernel. 

Effectively, the shared memory acted as a 
cache, but the programmer needed to stage and 
remove data manually, with sometimes substan- 
tial (and often not very obvious) impact on the 
code's performance. This manual caching, while 
still available, has been enhanced by the option 
to use automatic caches in NVIDIA's Fermi ar- 
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chitecture. The programmer then has the option 
to devote more shared memory either to manu- 
ally controlled storage, or rather to the automatic 
cache, depending on the application problem and 
programming style. 

HORIZON exclusively uses the automatic cache 
when it is available on the GPU. This has simpli- 
fied the implementation considerably, and leads 
to very good performance results. While it is 
possible that manually optimized kernels using 
specifically staged shared data could outperform 
our current implementation, we consider the ef- 
fort not worth the possible effect. 

3.3.3. Flux calculation and local arrays 

The flux calculation is very similar in the x, y 
and z direction. Because of this, one may consider 
to use temporary arrays (and code) for flux cal- 
culation in only one direction, but stage data by 
a map from 3D arrays into the temporary arrays 
(which have a different data layout). Alterna- 
tively, a straightforward direction-dependent flux 
calculation, which has a very different memory 
access pattern, can be used. 

The advantage of one method over the other 
is not obvious: In the former case, compared to 
the latter, more data needs to shifted around, but 
the resulting (expensive) flux calculations can be 
done in a manner which is more consistent with 
the rules for coalescence. We have experimented 
with both methods, but found the second ap- 
proach to be superior on current Fermi architec- 
tures. 

3.3.4. Global reductions 

A practical implementation of a GRMHD code 
requires at some stage to perform different kinds 
of global operations, e.g. to approximate integrals 
and build different norms. Many of these oper- 
ations could be done in post-processing on the 
CPU, but we have opted to also support global re- 
ductions on the GPU side. This is obviously more 
complicated than in a serial code, since a paral- 
lel reduction can not be performed via a simple 
loop due to data dependencies. For clusters, MPI 
libraries provide ready-made parallel reduction 
operations, but CUDA or other GPU languages 
do not. The programmer can decide to construct 
a simple hierarchical reduction scheme manually, 



but it is easier to rely on research on very efficient 
implementations of reductions, parallel scan and 
similar operations in the form of parallel libraries. 
For CUDA, both the CUDPP and the THRUST li- 
brary are available. In HORIZON we employ the 
THRUST library, a C++ template implementation 
which offers a number of additional implementa- 
tion benefits. 

3.3.5. MPI parallelization 

In order to make use of clusters of GPUs, we 
have added MPI support to HORIZON. The ap- 
proach is a standard ghost-zone synchronization 
method which is also employed in CACTUS and 
THOR, where we extend the computational do- 
main by a certain number of cells as appropri- 
ate for the numerical stencil of our scheme. The 
global domain is divided into smaller blocks of 
comparable size, and each GPU in the cluster is 
assigned to an individual block. In the synchro- 
nization step, we first prepare device-side buffers 
to hold the internal mesh information close to 
the boundaries (i.e. the information which needs 
to be transferred to the ghost cells of an adja- 
cent domain block), and invoke a kernel to copy 
mesh data to these buffers on each GPU. Then, 
the buffers, which are arranged in a linear fash- 
ion inside the GPU's memory, are copied to the 
host memory. Afterwards, MPI messaging is 
used to exchange buffer contents between adja- 
cent blocks, and the buffer contents are copied 
back to the GPU memory. The final step invokes 
another kernel to distribute this buffer data to the 
ghost cells on the local mesh. 

4. Performance results 

Our focus in this section will be on the perfor- 
mance differences between the CPU-based THOR 
code and the GPU-based HORIZON implemen- 
tation described in this paper. Our intent is to 
give practical insight into what can be gained 
from porting a GRMHD code to GPUs: it should 
be clear that the particular numbers stated here 
are a consequence of (a) the particular hardware 
used, (b) our particular implementation of the 
CPU-based THOR code, and (c) our particular im- 
plementation of the GPU-based HORIZON code. 
Specifically, it is possible that the CPU-based 
code, which is based on Fortran 90, could be fur- 
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ther optimized. The results should therefore be 
considered a guide to order-of-magnitude perfor- 
mance gains a practitioner can expect from using 
GPUs for relativistic flow problems. 

Nevertheless, we have endeavored to perform 
a fair comparison between the CPU and GPU 
codes by considering a number of different fac- 
tors: Firstly, the THOR code has more function- 
ality than the HORIZON code in that it supports 
multiblock meshes. In particular, this requires the 
evaluation of local Jacobians and their derivatives 
in several places. For purposes of the compar- 
isons, we have deactivated this functionality en- 
tirely. 

Secondly, THOR is part of the CACTUS com- 
putational infrastructure, while HORIZON is a 
standalone C++ application. Since CACTUS is a 
powerful and general environment for hosting 
a variety of very different modules with poten- 
tially complex interactions, it is quite possible 
that, even when we primarily use only one such 
module, the general computational overhead due 
to the infrastructure skews the comparison. We 
therefore normalize all numbers obtained from 
THOR by comparing to an empty evolution us- 
ing CACTUS in the sense that all computations in- 
volving THOR are switched off In practice, we 
have found this overhead to be small. 

Thirdly, we disregard time spent in the ini- 
tial data setup for each code in the comparisons, 
and rather compare the wall clock times spend 
on the main evolution loop. This is sensible 
since benchmark comparisons are made for very 
few iterations, whereas actual simulations run 
much longer, and therefore quickly amortize ini- 
tial setup cost in typical cases. This is certainly 
true for setup cost when we use shock tubes and 
RNS data. A comparison which is predictive for 
long production simulations should therefore ex- 
clude the setup costs. 

Most tests here did not involve time spent 
in writing data to disk, i.e. we have switched 
off output. The reason is that output is entirely 
limited by the (slow) disk access time, particu- 
larly when writing large quantities of data, and 



2 In fact, this skews the numbers in disfavor of HORIZON, since 
the calculations for the time update using the method of lines 
are not regarded in this way, but they will be counted in the 
Horizon result. 



this factor can be very different between hard- 
ware implementation and particular simulation 
parameters. It is clear that a simulation which 
constantly writes a very large amount of data 
to a disk, e.g. to produce an animation, will 
find that transfer a bottleneck for performance, 
and by Amdahl's law one would see little differ- 
ence between CPU and GPU implementations. 
However, in more typical situations, output is re- 
stricted to norms and integral quantities every 
few time steps, and those should not have a large 
performance impact, provided the fast parallel 
reductions discussed above are used on the GPU. 
We will present one test which estimates the per- 
formance impact of disk output in a typical ap- 
plication situation. 

The usual approach to measure relative GPU 
to CPU performance is by comparing a serial 
code running on the CPU with a parallel code 
running on the GPU, resulting in the so-called 
parallel speedup. This is a standard specification of 
scaling behavior used in parallel environments, 
and it is simply defined as the ratio of the wall- 
clock time spent on the serial simulation with the 
wall-clock time in the parallel simulation. Since 
this is the number which is being used in almost 
all literature on GPU performance, we will also 
report speedup measurements in this sense here. 

However, modern CPUs are of course also 
mildly parallel, using typically four cores in 
present implementations, and therefore a more 
fair comparison could be to compare the speedup 
of the GPU-parallelized code with a CPU-parallelized 
code running on four cores. In fact, we think this 
is a comparison which should be more relevant 
in practice, since modern lU-sized cluster nodes 
usually have either four (quad-core) CPUs or four 
GPUs, and therefore such a comparison could re- 
flect actual wall-clock time differences in practice 
more faithfully. Therefore, we will also report 
these performance results in the following. 

The CPU simulations have been performed on 
an Intel Xeon E5620 inside a dual-socket work- 
station. GPU architectures used were an NVIDIA 
Geforce GTX 580, and a dual NVIDIA Tesla C2070 
workstation for MPI tests. 
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4.1. Timing results 

While performance comparisons could be 
done with almost arbitrary data, we will con- 
sider a setup which is close to what we use in 
practice: the simulation of a rapidly rotating neu- 
tron star model. This is used e.g. for neutron 
star asteroseismology < |Zink et al.||2010} , and as 
such the results stated here should be indicative 
of real-world performance of GPUs in actual sci- 
entific applications. In this section we will focus 
on relative performance, while we consider the 
different accuracy of single and double precision 
calculations in section [5] 

The initial data is generated using the RNS 
code (Stergio ulas & Friedman|l995[ |: it is a stan- 
dard uniformly rotating neutron star model with 
a polytropic stratification (model BU2 from Dim- 
melmeier et al. (2006)). We will not discuss the 



details of this model since they are not important 
for purposes of a performance analysis, but only 
state that it is based on a polytropic index N — 1, 
has a mass of 1 .46 Mq and a rotational period of 
about 2 ms. 

We perform 100 iterations of the simulation at 
grid resolutions of 60 x 60 x 60, 90 x 90 x 90, and 
120 x 120 x 120. We do not perform output to 
disk, and we normalize for initial data setup and 
computational overhead (the latter only in the 
CPU case). 

Table [T] shows the wall-clock time needed to 
perform this problem on either one CPU core, 
four CPU cores (both double precision accuracy), 
on the GPU in double precision, and on the GPU 
in single precision accuracy. Overall, the GPU 





60 3 


Grid size 
90 3 


120 3 


CPU (1 core) 


372.9 s 


1278.2 s 


3110.8 s 


CPU (4 cores) 


101.3 s 


326.4 s 


785.2 s 


GPU (double pr.) 


9.1s 


30.3 s 


60.1 s 


GPU (single pr.) 


2.1s 


6.7 s 


12.0 s 



Table 1: Rapidly rotating neutron star model: per- 
formance comparison between the CPU-based 
THOR code and the GPU-based HORIZON code 
for different problem sizes. All times are wall- 
clock times for 100 evolution steps. 



offers a substantial increase in performance over 
both single- and quad-core evolutions performed 
on the CPU. 

The associated speedup factors are reported in 
table [2] It is evident that the problem size has a 
substantial influence on the relative performance, 
an observation we have already made in the case 
of an Einstein code ( Zink 20081. It can be expected 
that a GPU, which is architecturally based on run- 
ning as many concurrent threads as possible to 
hide memory access latency, will perform better 
on more threads (i.e. larger problem sizes). Nev- 
ertheless, part of the increased ratio could also be 
attributed to an adverse scaling of the CPU code 
with the problem size. 
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Fig. 1. — Cost per cell update for different prob- 
lem sizes. The upper panel shows the CPU cost 
in microseconds, while the lower panel shows 
GPU cost (single precision) in nanoseconds. The 
CPU performance is almost independent of the 
problem size, while the GPU performs better with 
larger grids. 

To investigate this, we calculate the computa- 
tional cost per cell and iteration, and report the 
results in figure [T] A simple assumption of a ma- 
chine's performance scaling (that is, when we as- 
sume that only operation count is relevant for the 
cost) would imply that is number is almost con- 
stant when varying the grid size. From the plot, 
it is apparent that the CPU code almost has this 
property. However, the GPU becomes more effi- 
cient with larger problem size (i.e. a larger num- 
ber of threads). It is therefore advisable to satu- 
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rate the parallel processor as much as possible to 
obtain improved overall performance. 

4.2. Relevance of disk output cost 

In this section, we consider the impact of disk 
output on the performance result. As discussed 
before, writing arbitrarily large quantities of data 
to disk would make the disk access the bottleneck 
in the problem, and in such a case neither CPU 
nor GPU speed may be very relevant. In contrast, 
we select a set of output parameters typical for 
our own scientific simulations ( Zink et aLp OlO 



and observe the performance impact of the out- 
put. For this test, we evolve a neutron star for 
a time interval of 1 ms with a time step of 0.5 ^s. 
Output of overall norms (maximal value of den- 
sity and similar) is performed every 5 }is, whereas 
one-dimensional profile output along each axis, 
and also three-dimensional output of most evo- 
lution quantities, is performed every 50 /;s. The 
test is performed on a regular workstation with a 
SATA disk: a dedicated high-performance RAID 
system is expected to exhibit a smaller impact on 
performance. We report the results of the com- 
parison in table [3] As can be seen, output does 
not substantially affect performance in this typi- 
cal application case. 

4.3. Performance on multiple GPUs 

HORIZON supports MPI parallelization, and 
therefore it is interesting to investigate the cost of 
scaling to multiple GPUs. At the time of writing, 
we have access to a dual Tesla C2070 setup inside 
a single workstation, and the following obser- 
vations will therefore compare performance be- 
tween a single and two GPUs. 

We have tested the MPI scaling performance of 
HORIZON using a grid size of 120 x 120 x 120 for 
a single GPU, and 240 x 120 x 120 for two GPUs. 
This choice comprises a weak scaling test, which 
should be appropriate in the present context since 
it excludes the grid-size dependent performance 
of the GPU (see figure [T]| from the consideration 
of MPI messaging overhead measurements. The 
actual initial data is of little consequence in the 
measurement: because of the grid setup, we have 
opted for a Riemann problem. 

The results of the timing tests are reported in 
table |4] As for the disk output case, the parallel 





60 3 


VJllU. OIZjC 

90 3 


120 3 


GPU (SP) over 
CPU (1 core) 


177.7 x 


191.6 x 


259.2 x 


GPU (DP) over 
CPU (1 core) 


41.0 x 


42.2 x 


51.7 x 


GPU (SP) over 
CPU (4 cores) 


48.2 x 


48.9 x 


65.4 x 


GPU (DP) over 
CPU (4 cores) 


11.1 X 


10.8 x 


13.1 x 



Table 2: Relative performance (speedup) derived 
from table [T] The first row is the speedup mea- 
sure usually employed in most GPU publications. 
However, the third and fourth rows, each com- 
paring either a GPU simulation in single preci- 
sion (SP) or double precision (DP) floating point 
accuracy to a quad-core CPU simulation, should 
be more relevant for a practical assessment of 
GPU over CPU performance. 





GPU (SP) 


GPU (DP) 


No output 


132.50 s 


635.54 s 


With output 


134.06 s 


637.46 s 


Relative cost 


1.17% 


0.30 % 



Table 3: Impact of disk output on GPU simula- 
tion performance, using an output frequency typ- 
ical of a simulation performed in neutron star as- 
teroseismology. GPU simulations have been per- 
formed in both single (SP) and double precision 
(DP), with the expectation that the relative cost is 
lower in double precision. 





SP 


DP 


1GPU 


17.74 s 


79.81 s 


2 GPUs (MPI) 


18.38 s 


80.94 s 


Parallel efficiency 


96.5 % 


98.6 % 



Table 4: Cost of MPI parallelization (parallel effi- 
ciency), comparing a simulation running on two 
GPUs with a single GPU simulation, both in sin- 
gle (SP) and double precision (DP). This is a non- 
trivial test for GPUs even when only two MPI 
processes are used. 
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synchronization is a serial point inside the par- 
allel program, and therefore could be a limiting 
factor by Amdahl's law. However, in contrast 
to disk output, the MPI synchronization needs 
to be performed at every substep of the method 
to have a consistent global state. In principle, 
this could give rise to a substantial reduction in 
overall performance, but in the present case we 
observe good scaling efficiency. These results, 
while encouraging, are of course only partially 
indicative of parallel efficiency on a full cluster. 
However, given a particular (large) problem size, 
substantially less GPUs are needed to obtain the 
same performance as CPUs, thus requiring less 
MPI processes in comparison. 

5. Comparison between single and double 
precision accuracy 

This section will repeat selected standard test 
cases for GRMHD already performed with THOR 
using the GPU-accelerated HORIZON code, with 
a focus on differences between single and dou- 
ble precision floating point accuracy. This is of 
particular interest as a consequence of the perfor- 
mance gap between these options, and whether 
a hybrid programming model, i.e. one that uses 
single precision accuracy in most operations and 
double precision in only selected ones, could be 
feasible. The results reported here use either pure 
single or double precision accuracy operations. 

5.1. Shock tubes 



We will first consider the Sod test ( |Sod||1978) , 
which is a standard test case for compressible hy- 
drodynamics codes. We prepare a Riemann prob- 
lem in an ideal gas with T = 5/3, and with initial 
states pi = 1, Pi = 1, u l L = 0, and pr = 0.125, 
Pr = 0.1, u' R = 0. We prepare this problem on 
a full 120 x 120 x 120 grid using HORIZON, and 
simulate the evolution up to a time t = 0.8. For 
comparison, a reference solution using THOR has 
been produced using a very fine grid. 

Figures [2] and [3] show the density and x- 
velocity profiles at the end of the evolution. The 
profiles show the somewhat dissipative nature of 
the HLL solver and TVD limiter we are using, but 
in particular, they show no discernible difference 
between single and double precision results. In 
fact, these differences are \Sp\ « 5 x 10~ 5 and 
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Fig. 2. — Sod test: comparison between single and 
double precision accuracy. This plot shows the 
density profile at time t = 0.8. The GPU results 
were obtained with HORIZON, whereas the refer- 
ence solution is produced with THOR on a very 
fine grid. 



\8u* 



10 4 exactly at the location of the shock, 



and typically \Sp\ < 1 x 10" 6 and \Su x \ < 10" 
in smooth parts of the flow. These errors are at or 
lower than the level of the discretization error of 
the scheme. 

A set of tests for magnetized flows have been 
proposed by Balsara (2001 1. We will show results 
from the first Balsara test, with initial data pi = 1, 
P L = 1, 4 = 0, B X L = 0.5, B{ =1,B z l = 0, and 
Pr = 0.125, P R = 0.1, u R = 0, B X R = 0.5, B R = -1, 
B R = 0. The setup is otherwise the same as for 
the Sod test. 

In figure |4] we show a comparison of the mag- 
netic field vector component B X J (which is initially 
discontinuous) at the end of the evolution, as a 
representative quantity for comparison. As in the 
case of the Sod test, also the magnetic field struc- 
ture is not significantly affected by using single 
precision accuracy, at least in this test case. The 
absolute differences between single and double 
precision are \5BV\ < 10~ 6 , and similar state- 
ments hold for all other evolved quantities. 

5.2. Oscillations in a rapidly rotating neutron 
star 

A next comparison between single and dou- 
ble precision accuracy will be performed using 
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Fig. 3. — Same as figure [2j but comparing the x 
velocity u x . 



a slightly perturbed stellar model, 
model BU2 as in section 



4.1 



We use the 
but additionally per- 
turb it with a small quadrupole deformation {I = 
2, m = 0) in order to analyze the dynamics of the 
star. 

The tests are performed with a moderate reso- 
lution (90 x 90 x 90, with about 60 - 70 cells cov- 
ering the stellar diameter), and we evolve the star 
for 5 ms physical time, corresponding to 10,000 
iterations. Figure [5] shows a comparison of the 
central density evolution using single and double 
precision simulations. As can be seen, the differ- 
ences are very small, and likely smaller than the 
discretization error of the system. When trans- 
forming both time profiles into Fourier space (not 
shown here), we similarly get virtually identi- 
cal spectra, and can identify the fundamental 
radial mode F at 2596 ± 30 Hz, its overtone Hj 
at 4403 ± 30 Hz, the fundamental quadrupole 
mode 2 f at 1909 ± 30 Hz, and its overtone 2 pi 
at 3970 ± 30 Hz. These numbers match very well 
with Gaertig & Kokkotas (2008), but specifically, 
there is no discernible difference between the sin- 
gle and double precision results. 

Another very sensitive quantity is the rota- 
tional velocity profile of the star. In figure |6j we 
compare the x-axis profile of ^(x)^] after 5 ms. 



Noble et al. 



2006 



3 We use the primitive velocity variables of 
in which the four- velocity ill', the Lorentz factor 7, the lapse 
function a and the shift vector f>' induce spatial veloc ities u' = 
u' + yfS'/x, which are related to the quantities v' {Banyuls 



Fig. 4. — Balsara 1 test: comparison between 
single and double precision accuracy. This plot 
shows the magnetic field ffl profile at time t = 
0.8. The GPU results were obtained with HORI- 
ZON, whereas the reference solution is produced 
with THOR on a very fine grid. 



Again, virtually no differences between single 
and double precision results are apparent, even 
in the pro blematic reg ion near the stella r surface. 
The maximal deviation is \SW\ < 5 x 10 _D . 



We conclude that, at least for the applica 
tions presented here, single precision calculations 



should be sufficient. However, as with all other 



parameters entering a numerical calculation (dis- 
cretization scheme, time step, spatial resolution, 



.), this assumption must be tested on a case- 
by-case basis. We will further discuss this point 
below. 



5.3. Strongly magnetized neutron star 



As a final point of comparison we explore a 
magnetized neutron star model. For this test, we 



use the nonrotating model BUO <|Dimm elmeier 



et al.||2006|, which is defined by p c = 1.28e~ 



and r — 2 and has a mass of M = 1.4 M . On 



this model, we superimpose a toroidal magnetic 
field given by (|Braithwaite|2006) 



B = B n 



cd y 



2 e- z/2H ef 



(4) 



|et al.|1997) used in some other GRMHD codes by the Lorentz 
factor. However, this distinction should not be important in 
the present context. 
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Fig. 5. — Oscillations of a rapidly rotating neu- 
tron star model: comparison between single and 
double precision evolutions using the HORIZON 
code. This plot shows the central density of the 
star over a time of 5 ms. 



Fig. 6. — Oscillations of a rapidly rotating neu- 
tron star model: comparison between single and 
double precision evolutions using the HORIZON 
code. This plot shows the evolved rotational ve- 
locity profile 0(x) after an evolution of 5 ms. 



where CD = \J x 1 + y 2 , and we choose cDq = 
3, H = 1, and Bo such that the maximal field 
strength is approximately 2.6 x 10 15 G. The struc- 
ture of the initial magnetic field is shown in fig- 
ures [7]and[8] 

Even with these comparatively high field 
strengths, the magnetic field is only a small per- 
turbation to the fluid equilibrium. It is also 
known that this field is unstable Braithwaite 
( |2006) ; [Lander & Jones| ( |2^T^ ; pueFet~^ ( 12010) , 
but we will only be concerned here with differ- 
ences between single and double precision evo- 
lutions. As in the previous section, we evolve the 
model for 5 ms physical time on a 90 x 90 x 90 
grid. 

Figure [9] shows the profile BV(x) at the end of 
the evolution from both simulations. The dis- 
agreement is within \SB\ < 10 12 G, and therefore 
larger than in the non-magnetized case, but still 
well below 10 -3 of the actual field strength. 

6. Discussion 

This paper describes an implementation of 
general relativistic magnetohydrodynamics on 
graphics processing units, the HORIZON code, 
and presents results on the relative performance 
practitioners can expect to gain from employing 
GPUs in such a context. The main results from 



performance measurements, obtained in a setup 
which is close to applications in neutron star as- 



teroseismology, are summarized in figure 10 



We have argued that general relativistic mag- 
netohydrodynamics should map very well to 
GPUs as a consequence of the data-parallel na- 
ture of the problem, mostly regular memory 
access patterns, and a higher arithmetic inten- 
sity than is common in traditional Newtonian 
MHD codes. Timing tests have confirmed this 
expectation, and have shown substantial perfor- 
mance gains over current CPU architectures. We 
have selected a particular CPU /GPU combina- 
tion for these tests, but this general picture will 
not change when comparing different sets of cur- 
rent hardware implementations, even though the 
actual speedup numbers will vary. 

From an astrophysical application perspective, 
we can expect about an order of magnitude in- 
crease in performance by employing a double 
precision GPU calculations in place of a quad- 
core CPU, and several orders of magnitude when 
employing only single precision arithmetics. This 
is a direct consequence of the massively parallel, 
high-throughput nature of the GPU, and the fact 
that GRMHD maps well to this kind of architec- 
ture. 

Given the often very demanding problems 
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Fig. 7. — Strongly magnetized neutron star: Ini- 
tial distribution of the amplitude of the magnetic 
field within a meridional cut, with contours up 
to about 2 x 10 15 G. The dashed line denotes the 
stellar surface. 

which result from implementation of astrophysi- 
cal models, and the high cost of three-dimensional 
explicit simulations, an order of magnitude in 
performance should be enough of an incentive to 
consider GPUs, and similar massively parallel ar- 
chitectures, in place of CPUs. Such a performance 
gain is interesting to attack substantially larger 
simulations (or longer evolution times) on com- 
pute clusters, provided a GPU-based machine is 
available. 

However, another use of GPUs should be men- 
tioned which is no less important in practice: 
the ability to perform reasonably sized full-scale 
simulations on a single workstation, with a high 
turnaround time for results. This is relevant to ex- 
plore new numerical methods and model param- 
eters spaces, which can now be performed much 
faster, and much more conveniently, than using 
many nodes on a cluster resource. In this way, 
scientific productivity can be gained very easily 
with a modest investment in hardware. 

The much higher throughput obtained in sin- 
gle precision computations begs the question 
whether computations could also be performed 



Fig. 8. — Same as figure [7j but inside the equato- 
rial plane of the star. 

with a reduced level of accuracy. We would first 
like to stress that this aspect of course also applies 
to the distinction between double and quadru- 
ple precision, and there may well be problems 
for which even double precision is not sufficient. 
Overall, this is entirely dependent on the particu- 
lar problem and numerical algorithm employed, 
and only experimentation can establish whether 
a particular floating point approximation is suffi- 
cient. Of course, this equally applies to all other 
free parameters of the discretization. 

In the particular cases we have investigated 
here, which consider shock tubes, small oscilla- 
tions of rotating stars, and strongly magnetizes 
stars, it seems that single precision floating point 
accuracy could be sufficient for a number of mod- 
eling purposes. That statement may well cease 
to be true if tabulated equations of state, tur- 
bulence, radiation transport, reaction networks 
etc. are involved in a calculation. If such is 
the case, a good strategy could be to locate (ap- 
proximately) the components of the numerical 
code which should require double precision ac- 
curacy, and then employ the higher accuracy only 
in these parts. HORIZON offers exactly this option 
by allowing to adjust the accuracy of the conser- 
vative to primitive variable transformation and 
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Fig. 9. — Strongly magnetized neutron star: Com- 
parison between single and double precision evo- 
lutions. This plot shows the profile BV(x) after an 
evolution of 5 ms. 

the calculation of flux differences. Such a hybrid 
approach could preserve very high levels of per- 
formance while keeping the required level of ac- 
curacy. 

But the most important aspect of GPU com- 
putation, and more generally, massively data- 
parallel hardware architectures (which may also 
be located on a CPU), are not the potential gains 
which could be achieved at the present time: it 
is the fact that the peak performance in these 
hardware designs grows much more rapidly than 
the peak performance of traditional CPUs. That 
is, the relative factor of 10 we can achieve to- 
day may translate into a factor of 20 in two 
years time. Preparing for this major shift in 
high-performance architecture, which is not only 
driven by the isolated requirements of a small 
market segment as in the past, but pervades 
the global development of all consumer and 
professional-level microprocessors, should be 
considered a priority for future computational 
tools in astrophysics. 
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Fig. 10. — Relative performance of the GPU- 
based HORIZON code compared to the CPU- 
based THOR code. These numbers approximately 
represent the timing results obtained in section |4j 
which have been measured with an Intel Xeon 
E5620 CPU and an NVIDIA Geforce GTX580 
GPU. The first number is the one typically used 
in publications on GPU codes, whereas the third 
and fourth row show more realistic comparisons 
between a parallel code running on a GPU and a 
parallel code running on a quad-core CPU. In sec- 
tion [5] we have presented results indicating that 
single precision accuracy may be sufficient for a 
number of applications. Further discussion of 
this point can be found in the text. 
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